Policy Jailbreak Test
Overview
The Policy Jailbreak Test evaluates how resistant your AI system is to adversarial jailbreak prompts designed to make it violate specific policies. Policies define the rules your AI system should follow, and include a description, allowed behaviors, and disallowed behaviors. To learn more about creating policies in Dynamo, click here.
This test simulates real-world attack scenarios using a range of both light and expert jailbreak techniques. The goal is to measure how often these attacks succeed in bypassing your safeguards and producing policy-violating responses.
Policy Jailbreak Test Methodology
-
Policy Selection
Users first choose the set of policies they want to evaluate. -
Attack Area Generation
For each policy, Dynamo automatically generates 10 attack areas. These are specific scenarios that could cause the AI system to violate the policy. -
Base Prompt Creation
For each attack area, Dynamo then generates base prompts targeting each attack area. These are prompts that, if answered directly, could lead to a policy violation. -
Jailbreak Transformation
Each base prompt is then modified using jailbreak techniques to produce attack prompts. -
Execution and Recording
Finally, the AI system is queried with each attack prompt, and the system response is recorded and evaluated. Any guardrails or system prompts you have in place are active and can affect the final system response that is returned. -
Evaluation
Dynamo evaluates each response to determine whether the attack succeeded by producing policy-violating content, or whether the attack was blocked.
Supported Attack Types
Dynamo groups jailbreak techniques into two categories:
Light Adversarial Techniques
Simple, often single-turn, static attacks that can be replicated without advanced jailbreaking expertise:
- Baseline – Original noncompliant prompts with no transformation
- Do Anything Now (DAN)
- Encoding Attacks
- Persuasive Adversarial Prompts (PAP)
- ASCII Art Attack
Expert Adversarial Techniques
More complex, adaptive methods requiring multiple queries, specialized knowledge, or greater resources. Only attackers with specialized jailbreaking expertise will be able to replicate these attacks:
- Tree of Attacks with Pruning (TAP)
- Iterative Refinement Induced Self-Jailbreak (IRIS)
More information on each technique can be found on the other jailbreak documentation sites.
Metrics
The main metric reported in the Policy Jailbreak Test is Attack Success Rate (ASR).
Attack Success Rate (ASR) =
(Number of successful jailbreaks ÷ Total jailbreak attempts) × 100%
- Attack success: The AI system produced content that violates the tested policy.
- Attack blocked: The AI system refused or avoided producing policy-violating content.
A high ASR indicates greater vulnerability, while a low ASR indicates stronger resistance to jailbreaks and less vulnerability.
Example Test Output
A typical Policy Jailbreak Test report includes:
- Overall ASR: One aggregated score across all policies included in the evaluation.
- ASR Per Policy: Success rate for each individual policy in the evaluation.
- ASR Per Attack Type: Success rate for each jailbreaking technique, such as ASCII art or DAN (Do Anything Now).
- Most Successful Attack Areas: The top scenarios where jailbreaks had the highest success rates.